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Abstract. We demonstrate that the principle of maximum relative entropy (ME), used judiciously, 
can ease the specification of priors in model selection problems. The resulting effect is that models 
that make sharp predictions are disfavoured, weakening the usual Bayesian "Occam's Razor". 
This is illustrated with a simple example involving what Jaynes called a "sure thing" hypothesis. 
Jaynes' resolution of the situation involved introducing a large number of alternative "sure thing" 
hypotheses that were possible before we observed the data. However, in more complex situations, 
it may not be possible to explicitly enumerate large numbers of alternatives. The entropic priors 
formalism produces the desired result without modifying the hypothesis space or requiring explicit 
enumeration of alternatives; all that is required is a good model for the prior predictive distribution 
for the data. This idea is illustrated with a simple rigged-lottery example, and we outline how this 
idea may help to resolve a recent debate amongst cosmologists: is dark energy a cosmological 
constant, or has it evolved with time in some way? And how shall we decide, when the data are in? 
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1. INTRODUCTION 

In Bayesian model selection, we have two or more competing hypotheses, H\ and 
H2, with each possibly containing different parameters 9\ and 02- We wish to judge 
the plausibility of these two hypotheses in the light of some data D, and some prior 
information /, dropped hereafter for succinctness. Bayes' rule provides the means to 
update our plausibilities of these two models, to take into account the data D: 

PmO) _ P{H2)P{D\H2) _ P{H2) Jpi0i\H,)piD\euHi)dei 
P{Hi\D) P{Hi)P{D\Hi) P{Hi)'' Jp{02\H2)piD\e2,H2)d02 

Thus, the ratio of the posterior probabilities for the two models is the prior odds ratio 
times the evidence ratio. 

If the various probabilities on the right-hand side of Equation [Hare a good description 
of our prior beliefs, then the posterior probabilities will encode justified conclusions 
based on the data. However, practical use of Equation[I]is often regarded with scepticism 
[ELSIi^]- This is primarily because the probabilities on the right-hand side are difficult 
to specify without making ad hoc choices. 

For reasons that are mostly historical, the prior distributions p{0i\Hi) and p(02|^2) 
for the parameters of each model are usually considered the most troubling. The prior 
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model probabilities are often set to 1/2, citing symmetry, and the sampling distributions 
are usually considered uncontroversial. However, in many real scientific applications, 
assigning priors is trivial compared to the job of assigning sampling distributions (Hogg, 
priv comm); i.e. modelling how the question of interest would affect our data. 

While many Bayesians would assert that the dependence on subjective judgments 
exists because the result should actually depend on these judgments, it seems as though 
there ought to be ways to reduce the subjective influences in the prior probabilities and 
sampling distributions, even if they can never be entirely eliminated. In fact, this is the 
entire reason for using Bayes' rule in the first place flT]. Rather than simply looking at 
the data and then assigning a posterior distribution directly, we make use of one objective 
thing we actually know, Bayes' rule. In this paper, we discuss how the principle of 
maximum relative entropy (ME) ^ can be used to further reduce, though not eliminate, 
the subjectivity of Bayesian inferences. The key requirement of this approach is that we 
must have a realistic probabilistic model of our prior beliefs about the data, i.e. our prior 
predictive distribution for the data must be modelled carefully. 



1.1. Publishing the Evidence 

SkiUing H recommends that whenever some data is analysed using a model Mi, the 
evidence Zi = p{D\Mi) = J p{6i)p{D\6i)d6i be presented. This way, anyone proposing 
a different model M2 can calculate their own evidence Z2 and carry out model compari- 
son with Equation [H without the need to recalculate Zi, which was published by the first 
author. This is good advice that has been taken by many in the astronomical community 



[|17l ll8l1. however, it is not the whole story. The plausibility of a model does not depend 
only on the evidence, it also depends on the prior probability (Equation [T]). A large evi- 
dence ratio can easily be cancelled by a tiny prior probability ratio and vice versa. The 
sure thing problem, discussed in Section |2l is simple and well-known example of this 
fact. 



2. A SURE THING PROBLEM 

Suppose a simple lottery is held, with tickets numbered from 1 to 1,000,000. Each ticket 
is sold to a different person. Consider a hypothesis Hi, which states that the lottery is 
fair, and thus the probability of any particular ticket winning is 10^. The draw is carried 
out, producing the following data D: The winner of the lottery was ticket #263878. Alice 
publishes a paper that reports this data, and proposes the fair lottery model Hi to explain 
it. She presents the evidence Zi = P{D\Hi) = 10^^. 

Bob, a professional rival of Alice, reads her paper and proposes a different model, H2: 
The lottery was not fair. It was rigged in order to make ticket #263878 the winner. Bob 
writes a paper presenting the evidence Z2 = P{D\H2) = 1 . Thus, he concludes, if Hi and 
H2 are initially equally plausible, the data makes H2 a million times more plausible than 
Hi. Clearly, something is not quite right with this conclusion. 



2.1. Jaynes' Solution: Introduce extra hypotheses 



Jaynes ^ resolves the sure thing paradox in the following way. When Bob does a 
model selection between Hi and Hj with P{Hi) = P{H2) = \, he is implicitly stating 
that before getting the data, he would have predicted ticket #263878 with a probability 
greater than 50 %. Clearly, there is no way he could have known this before seeing 
the data. Actually, before observing the data, there were 999,999 other "sure thing" 
hypotheses that were on an equal footing with H2. The correct analysis would involve 
a bigger hypothesis space containing 1,000,001 hypotheses: H\, and the 1,000,000 sure 
thing hypotheses {5i,52, •••,5'looo,ooo}5 where S263878 = ^2- Bob should have assigned 
1/2 of the prior probability to Hi and divided the other 1/2 evenly amongst the 5's. Then, 
the prior probability of //2 is 5 x 10^ and its posterior probability is 1/2. This is the 
correct result; knowledge of the winning ticket number does not affect the plausibility 
of foul play. This argument resolves the sure thing problem by introducing a large 
number of alternatives into the hypothesis space, thus drastically reducing the prior 
probability of the particular sure thing hypothesis selected by the data. However, it is 
difficult to generalise this reasoning into more complicated scenarios where the principle 
of indifference cannot be used. 

Before the lottery was drawn. Bob would have assigned a uniform predictive distri- 
bution for the data. His reanalysis ought to reflect this, if not by introducing extra sure 
thing models, then by downweighting H2 somehow to reduce the spike it produces in 
the predictive distribution. While this is not the explicit motivation for entropic priors, it 
is a pleasant side effect, as we will show in the next section. 

3. ENTROPIC PRIORS 

In this section we introduce the notion of an entropic prior [0, [3]. Usually, Bayesian 
Inference is concerned with describing our knowledge in two stages: before taking into 
account the data, and then after taking into account the data. Bayes' rule is used to do 
this updating. Before taking into account the data, there is a prior distribution p\ (6) and 
sampling distributions pi{D\0) for all D and 9. The reason for the subscript '1' will 
become clear later. By the product rule, this is equivalent to defining a joint prior on the 
product space of possible hypotheses and possible data: 

pi(0,Z))=pi(0)pi(D|0) (2) 

Here, the usual prior p\{9) (actually a marginal distribution) describes prior knowledge 
about 0, and the sampling distributions p\{D\9) describe prior knowledge about how 9 
is related to the data D that we plan to observe. The key point here is that before learning 
the data, we are uncertain both about the parameters and about the data: p\{9,D) should 
model this state of uncertainty. 

In this paper, we will be concerned with describing uncertain knowledge about ( , £>) , 
so we will be using probability distributions on the product space. We will start from a 
joint prior po{9,D) and update this distribution twice to obtain the final joint posterior. 



We thus describe knowledge at three stages, defined below. 

• Stage 0: Before we observe the data, or even know what sampling distributions 
are. However, the parameter space and the data space have been defined, as well as 
priors over these spaces. At stage 0, our knowledge is pq{0,D). 

• Stage 1: Also before we observe the data. However, we have now specified the 
sampling distributions p{D\0) for all and D. At stage 1, our knowledge is 

Pi{e,D). 

• Stage 2: We now have the data. Our knowledge is p2{0,D). 

Updating from Stage 1 to Stage 2 is what we typically think of as Bayesian analysis. 
We prefer updating, rather than just writing down Stage 2 probabilities, because we get 
to use an objective updating rule, Bayes' rule. The idea behind entropic priors is to 
split up the process of assigning Stage 1 probabilities into two steps: Assigning Stage 
Oprobabilities, and then updating to Stage 1 using another objective updating rule, ME 
[jsl]. There is a lot of confusion in the literature about the relationship between these 
two principles. However, there need not be any tension between them if it is understood 
that Bayes' rule is to be used when we learn about propositions built from those in the 
product space, such as 'D = 42' or '0 + D < 1', whereas ME applies to propositions 
about probability distributions on that 5pac^ such as 'p{9,D) should be a Gaussian'. 



3.1. Updating from Stage to Stage 1 

Say we have a Stage joint prior, and we don't know the sampling distributions yet. 
Perhaps we haven't calibrated the instruments to see what kinds of output they typically 
produce. At this point our knowledge of {0,D) will often be independent, such that 
taking data before learning about the experiment does not tell us anything about the pa- 
rameters (it does tell us the data - and is therefore significant information in the product 
space. However, data is usually a nuisance parameter[!]). However, for generality we 
will allow dependence in the stage distribution: po{6,D) = po{0)po{D\6). 

We then learn information in the form of a constraint on allowable joint probabil- 
ity distributions: the sampling distributions p{D\6) for all 6 and D are given to us: 
p{D\6) = f{D; 6), where / is a given function. We must adjust our joint distribution so 
that this constraint is satisfied. By the rules of probability any distribution of the form 
p\{0,D) = pi{6)f{D; 6) is allowed, and we have absolute freedom to vary pi{0) while 
still satisfying the constraint on the sampling distributions. However, there is a best 
choice for [3]: pi{9) should be chosen such that is as close as possible 

to po{d,D), i.e. we choose the (0) that maximises the relative entropy 

s=-ll,„ie,D)io,(!^^,e,D (3) 



This raises a philosophical point, as most information is ultimately in the form of data. However, we 
may summarise the resultant effect of a large amount of external data as providing a constraint on our 
probability distributions. 



Po(X'y) Pi(X'y) P2(x.y) 
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FIGURE 1. The basic idea behind entropic priors. An initially independent A^(0, 1) joint prior for two 
quantities x and y (to be thought of as "parameters" and "data" respectively) is updated once the sampling 
distributions p(y\x)\/y,x are known to be p{y\x) ^N{x, 1). When the data are known (in this case, y = 0.5), 
the joint distribution is updated again. This second updating is equivalent to the usual Bayesian process. 



Differentiating with respect to each value of pi{0) (i.e. its value at each 0) and setting 
to zero (with Lagrange multiplier term added): 

(s-x(^Jp,{e)de-\^^=o (5) 



Carrying out this calculation gives: 

Pi{e)ocpo{e)e'^''^^^ (6) 

where 

sme) = -lpme)io,{!^),D (7) 

Thus, the marginal for 6 after learning the sampling distributions, pi{D\6), is propor- 
tional to the original marginal po{9), but multiplied by the exponential of the entropy of 
the corresponding sampling distribution relative to pq{D\0) (usually just po{D)). This 
process of updating from stage to stage 1 using ME, and subsequently updating using 
data, is illustrated graphically in Figure [B 

Applying similar logic to model selection problems consists simply of applying the 
short-cut reasoning of the above paragraph: each hypothesis (model and its parameter 
value) gets its prior probability rescaled by a factor measuring the closeness of its 
predictions to our initial predictions. This makes it clear that invoking symmetry to 
assign P{H\) = P{H2) = j is flawed: the symmetry may be broken as soon as we assign 
the sampling distributions. 



4. SURE THING PROBLEM: ENTROPIC PRIORS SOLUTION 

For the lottery problem, our knowledge about the data before getting it, and before 
the two models have been specified, is described by a uniform distribution over the 
integers from 1 to 10^: po{D) = 10"^ for all z||. Because the two models haven't 
been specified yet, symmetry implies we must assign equal prior (marginal stage 0) 
probabilities pq{Hi) = po(^2) = j- The joint probabilities for the hypothesis and the 
data are therefore uniform and independent. 

The next step is to incorporate more information and update our probabilities to stage 
1. This information is not the data, but the specification of the sampling distributions: 
p{D\Hi) = IQ-^ for all D, and p{D\H2) = 1 if D = 263878 and zero otherwise. As 
explained in Section [31 the priors for the two hypotheses should be reweighted according 
to the exponential of the entropy of their sampling distributions with respect to the 
original predictive distribution. These entropies are 

lO" 1A-6 
S{D\Hi) = -'£lO-Hog—^ = (8) 

S{D\H2) = -llog^ = log(10-6) (9) 

(10) 

Thus, the solution to the lottery problem is: 



P{H2\D) 
p{Hi\D) 




The three factors here are the Stage odds ratio, the entropic correction factor, and 
the evidence ratio. The resulting conclusion is as it should be: knowing the winning 
lottery number provides no information about whether there is fraud or not. Usually, 
models that make sharp, correct predictions are favoured by Bayesian inference. In this 
example, this still occurs in the evidence ratio, but the entropic factor also penalizes 
H2 by the same amount for being unjustifiably confident compared to our honest prior 
predictive distribution po{D). 



5. EVOLVING DARK ENERGY 



The nature of dark energy, thought to be responsible for causing the observed late- 
time accelerated expansion of the Universe lll2l ll3n. is a key driver of many upcoming 
cosmological surveys and instruments [jll]. From a model selection point of view, one 
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of the key questions is whether the equation of state of dark energy, w, is exactly equal 
to minus one for all time or whether it has any temporal variation. The former case is 
equivalent to Einstein's cosmological constant, or non-zero vacuum energy, while any 
variation from this value, however small, indicates very different physics at play, such 
as the existence of a primordial scalar field, or other even more exotic possibilities iQ]. 
Here, model selection is really the key goal; the exact form of any evolution of w is less 
interesting than simply being sure that it does evolve, or at least have a value different 
from the cosmological constant. There has been vigorous debate in the literature on how 
to best answer this question [|5i|^,|lO]. Here we outline the contribution entropic priors 
can make to this debate; detailed analysis will be presented in a future contribution. 

We will consider four different models for how the dark energy equation of state has 
varied throughout the universe's history. Typically this is described as w varying as a 
function of a, the scale factor. 

Hq: Cosmological Constant A: w{a) = — 1 

Hi'. Constant, but non-A: w{a) = wq 

H2: Simple Evolving: w{a) = wq + {I — a)wa 

Hy. Complex Evolving: w{a) = Some model with many parameters 

Observations of type la supernovae, particularly how their apparent brightness decreases 
with redshift, is a strong probe of w The idea is to test the four models, given such 

data Is]], and to forecast the informativeness of proposed future missions fT]. However, 
not all of the models are physically well-motivated: e.g. Hq arises naturally from General 
Relativity, Hi and H2 are ad hoc "simple models", and Ht, expresses gross ignorance. 
Therefore, while it is fair to assign a large probability to Hq at Stage 1, it does not 
automatically make sense to share the remaining probability evenly amongst //1-//3. The 
reason for this is that Hi, H2 and Hj, may imply quite different predictive distributions 
for the data. If we build a po{D) that we trust, then the simple models Hi and H2 may 
be downgraded in prior probability solely because they make predictions that are too 
confideno Of course, if, in the course of building po{D), we explicitly think about the 
predictions of the H's, then this will not occur - entropic priors do not magically generate 
information. What they do is implore us to think about p{D) when assigning priors, a 
key sanity check that is often overlooked. 



6. CONCLUSIONS 

Bayesian model selection is a difficult task, both computationally and philosophically. 
If we are not careful, we can obtain misleading results. The idea presented in this paper, 
of assigning a realistic predictive distribution for the data and then penalizing models 
whose predictions differ from it, should assist in making Bayesian model selection 
analyses more reliable. 



The chance that an unknown function just happens to be a straight line is usually quite small, unless you 
have very good prior reasons to expect a straight line. 
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